Addressing scaling challenges in comparative genomics
نویسنده
چکیده
Comparative genomics is essentially a form of data mining in large collections of n-ary relations between genomic elements. Increases in the number of sequenced genomes create a stress on comparative genomics that grows, at worse geometrically, for every increase in sequence data. Even modestly-sized labs now routinely obtain several genomes at a time, and like large consortiums expect to be able to perform all-against-all analyses as part of these new multi-genome strategies. In order to address the needs at all levels it is necessary to rethink the algorithmic frameworks and data storage technologies used for comparative genomics. To meet these challenges of scale, in this thesis we develop novel methods based on NoSQL and MapReduce technologies. Using a characterization of the kinds of data used in comparative genomics, and a study of usage patterns for their analysis, we define a practical formalism for genomic Big Data, implement it using the Cassandra NoSQL platform, and evaluate its performance. Furthermore, using two quite different global analyses in comparative genomics, we define two strategies for adapting these applications to the MapReduce paradigm and derive new algorithms. For the first, identifying gene fusion and fission events in phylogenies, we reformulate the problem as a bounded parallel traversal that avoids high-latency graph-based algorithms. For the second, consensus clustering to identify protein families, we define an iterative sampling procedure that quickly converges to the desired global result. For both of these new algorithms, we implement each in the Hadoop MapReduce platform, and evaluate their performance. The performance is competitive and scales much better than existing solutions, but requires particular (and future) effort in devising specific algorithms. t el -0 08 65 84 0, v er si on 1 25 S ep 2 01 3
منابع مشابه
Addressing the Omics Data Explosion: a Comprehensive Reference Genome Representation and the Democratization of Comparative Genomics and Immunogenomics
Addressing the Omics Data Explosion: a Comprehensive Reference Genome Representation and the Democratization of Comparative Genomics and Immunogenomics
متن کاملComparative genomics of human stem cell factor (SCF)
Stem cell factor (SCF) is a critical protein with key roles in the cell such as hematopoiesis, gametogenesis and melanogenesis. In the present study a comparative analysis on nucleotide sequences of SCF was performed in Humanoids using bioinformatics tools including NCBI-BLAST, MEGA6, and JBrowse. Our analysis of nucleotide sequences to find closely evolved organisms with high similarity by NCB...
متن کاملReview of Techniques for Gene Sequencing, Annotation and Comparative Genomics
The availability and complete sequencing of many organisms has made comparative analysis of gene a new field of research. The explosion in sequenced genome data on daily basis made this task an enormous one. Several techniques and methods have been devised and applied to carry out genome comparison. In this work, we surveyed and presented an overview of common methods, techniques, tools and cha...
متن کاملApplications of hidden Markov models for comparative gene structure prediction
Identifying the structure in genome sequences is one of the principal challenges in modern molecular biology, and comparative genomics offers a powerful tool. In this paper we introduce a hidden Markov model that allows a comparative analysis of multiple sequences related by a phylogenetic tree. The model integrates structure prediction methods for one sequence, statistical multiple alignment m...
متن کاملAddressing NCDs: Penetration of the Producers of Hazardous Products into Global Health Environment Requires a Strong Response; Comment on “Addressing NCDs: Challenges From Industry Market Promotion and Interferences”
Timely warnings and examples of industry interference in relation to tobacco, alcohol, food and breast milk substitutes are given in the editorial by Tangcharoensathien et al. Such interference is rife at national levels and also at the global level. In an era of ‘private public partnerships’ the alcohol and food industries have succeeded in insinuating themselves into the global health environ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2013